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ABSTRACT 

We present a new QSO selection algorithm using a Support Vector Machine (SVM), a supervised 
classification method, on a set of extracted time series features including period, amplitude, color, and 
autocorrelation value. We train a model that separates QSOs from variable stars, non-variable stars 
and microlensing events using 58 known QSOs, 1,629 variable stars and 4,288 non- variables using the 
MAssive Compact Halo Object (MACHO) database as a training set. To estimate the efficiency and 
the accuracy of the model, we perform a cross-validation test using the training set. The test shows 
that the model correctly identifies ^80% of known QSOs with a 25% false positive rate. The majority 
of the false positives are Be stars. 

We applied the trained model to the MACHO Large Magellanic Cloud (LMC) dataset, which 
consists of 40 million lightcurves, and found 1,620 QSO candidates. During the selection none of the 
33,242 known MACHO variables were misclassified as QSO candidates. In order to estimate the true 
false positive rate, we crossmatched the candidates with astronomical catalogs including the Spitzer 
Surveying the Agents of a Galaxy's Evolution (SAGE) LMC catalog and a few X-ray catalogs. The 
results further suggest that the majority of the candidates, more than 70%, are QSOs. 
Subject headings: Magellanic Clouds - methods: data analysis - quasars: general 



1. INTRODUCTION 

A large catalog of Quasi-stellar object (QSO) is impor- 
tant for a variety of fields in modern astrophysics and ob- 
servational cosmology. QSOs have been used for studies 
of a) large scale structures based on the spati al clustering 
of QSOs ( |Shen et al.|[2007| |Ross et al.|[2009| , b) growth 
of centr al plack holes usmg th e estiniated black holes' 



holes 



masses (Kollmeier et al. 
and their 



host 



ga. 



2006), c) coevolution of black 



axies using lensed QSO hosts 
( Peng et al.|2006 l, d ) the epoch of reionization based on 
high redshifl QSOs ( [Becker et al.|2001[ [Fan et al.||2006[ ), 
e) dar k matter substructure using gravitationally lensed 
QSOs ( [Metcalffc Madau|2001[ [Miranda fc Maccio|2007 l 
and f ) properties ot the iiitergalactic medium determined 
by me asuring metallicity distribution using QSO spectra 
( [Viel et al. 2002 ; Simcoe et al. 2004 ) . 

One ot the most mterestmg properties of QSOs is the 
strong flux variation over a wi de range of wavelengths on 
timescales from days to years ( Hook et al.|1994 Hawkins 



2002 and references therein). It is believed that QSO 



variability is associated with accret ion disk instabilities 



(Rees 1984 



other possi 



Kawaguchi et al. 1998) although there are 
3le~explanations to r the s ource of QSO vari- 
ability, incl uding microlensing (Hawkins 1993; Zackrisson 
et al.''2003), s tarbursts and supernovae ( ^Terlevich et al. 
TI)92 ; Aret xaga et al.|1997 1 . It is debatable which mech- 
anism is the dominant source of variability (see Hook 
erar[[T994l [ Giveon et al.||1999i IVan den Berk et al.|M^ 
de Vries et al. 2005; Bauer et al. 2009|). Moreover, due to 



the lack ot long-time-span, well-sampled and high-quality 
QSO lightcurves, all these previous studies have investi- 
gated ensemble variabilities of QSOs. Thus it is impor- 
tant to have a large set of well-sampled QSO lightcurves 



in order to study both ensemble and individual QSO vari 
ability characteristics, which will help constrain th e the 
oretical models of the variability mechanisms (see Hook 



et al.|1994[|Cristiam et al.|1996[[Vanden Berk et al. 
and references therein). 

Many authors have attempted to select QSO candi- 
dates based on th e variability characteristics. For in- 
stance, Eyer[ j2002[ ) sele cted QSO candidate s from 68,000 
OGLE-II variable ^tars ( Zebrun et al.pOOl ) using colors, 
magnitudes and the structure function ot the variables. 
The structure function determines the time scale of vari- 
ability in a given lightcurve as a function of the time lag 
between observations (|Eyer^2002J . Among the selected 
133 QSO candidates, ^10% were confirmed to be QSOs 



pobrzycki et al.[[2002 



2005 ). [Geha et al.[ ( [2003[ ) (here- 

inatter G03) searched l'4U 7K7D MACHO sourc esthat have 



significant flux variation ( Alcock et al.|[2000 ). G03 used 
colors, magnitudes and two statistical parameters that 
quantify variability to select QSO candida tes. G03 then 
remo ved known MACHO variable stars (Alcock et al. 
2001 1 from the candidate list and finally examined the 



remaining candidates manually in order to remove false 
positives. Among the final 360 candidates, 259 were spec- 
troscop ically observed and 47 of them confirmed to be 
QSOs. [Sumi et al.[ ([20051 searche d about 200,000 vari- 
able objects ot the OGLE-II data (Wozniak et al.[[2002| 



and then used a few selection cuts such as magnitudes 
structure function and manual validation. No spectro- 
scopic observation was done for their final 97 QSO can- 
didates. 

Recently, four QSO selection methods have been sub- 
mitted or published, which proposed new QSO clas- 
sification algorithms using time series variability fea- 



2 



tures. One of them is the work done by| Kozlowski et al 



( 2010 ) that used a stochastic model shown in Kelly et a 



(20U9) which derives the amphtude and the time scale 
ot iignt curve variations. They also employed periods 
of hghtcurves and magnitudes. To develop their selec- 
tion method, they used the known QSOs, periodic vari- 
ables and non-period i c vari ables in the OGLE databases 
( [Udalski et al lflQQT', ^2 ^^. They also used QSO can- 
didates from [koziowski &; Kochanek (2009) that had 
OGLE counterparts. To separate the QSOs from other 
variables, they defined several cuts and correctly identi- 
fied 63% of the QSOs wh ile removing most of the variable 
stars. The second study (Schmidt et al. 20101 proposed a 
power-law model to fit the structure function and derived 
the amplitude and the power index of the model. They 
used the derived parameters to isolate known QSOs from 
RR Lyraes and non-variable stars extracted from the 
SDSS stripe 82 database (S82) (Sesar et al..2007) . Using 
simple cuts on the amplitude versus power index plane, 
they identified abo ut 90% of the SPSS Q SOs with a 5% 



false positiv e rate. Butler & Bloom (2010) and 



MacLeod 



et al. (2010) used similar approaches (i.e. structure func- 
tion) with the previous two works. Both utilized the 
preselected variable sources from the S82 dataset where 
the majority of the variables are QSOs, RR Lyraes a nd 



stars fro m the stellar locus (see Sesar et al. 2007 for 
details). 'Butler & Bloom (2010) parameterized the en- 
senible QSO structure function as a function of bright- 
ness of the QSOs. They then used the parameterized 
ensemble QSO model to eval uate the quasar likeliho od 
for individual lightcurves (see Butler & Bloom 2010 for 
details). Using this method, they identified nearly all 
the known SDSS QSOs (9 9%) with a 3% false positive 
rate. iMacLeod et al. (2010) also used the structure func- 



tion and several cuts to identify QSOs and exclude other 
variable stars from the S82 database. They correctly se- 
lected about 90% of the QSOs with 10~20% false positive 
rate depending on the cuts imposed. Both works also se- 
lected n ew QSO candidates from the preselected variable 



sources (Sesar et al. 



2007). These candidates have not 



been spectroscopically confirmed. Note that the efficien- 
cies or false positive rates of these studies should not 
be directly compared because each work used their own 
selected set of stars and QSOs to develop their meth- 
ods. For a com prehensive comparison of the results of 
the methods, see MacLeod et al. (2010). 



Even tho ugh some ot thes e recent works ( Schmidt et al. 
|2010; Butl er fc Bloo m 2010; MacLeod et aipUlOj ) showed 
iTigh efficiencies and low false positive rates, they used 
samples that are selected in such a way that high ef- 
ficiency and low false positive rate is to be expected. 
The separation of QSOs from non-varying stars and a 
few types of variable stars, especially short-period vari- 
ables (i.e. RR Lyraes) are relatively straightforward since 
QSOs show non-periodic and long-time scale fluctuation. 
The majority of the samples they used in these stud- 
ies are short-period variables and do not show long-time 
scale fluctuation, 

QSO selection methods based on variability will be 
valuable tools for on-going and future large s cale survey 
missions such as P an-STARRS ( |Kaiser||2004| and LSST 
(Ivezic et al. 2008). These surveys will keep monitoring 
wide areas of the sky and will produce vast amount of 
time series data in several wavelength bands (e.g. g, r, i, z 



for Pan-STARRS). Because spectroscopic observations 
for such wide areas are very expensive, QSO selections in 
the absence of spectroscopic data are becoming impor- 
tant, and thus developing QSO selection methods using 
variability are rapidly attracting notable attention. 

The work presented in this paper utilizes the whole 
MACHO lightcurve databset considering all known vari- 
able sources in the MACHO database. Thus this is the 
first work that considers the efficiency and the false posi- 
tive rates of QSO selection in an entire lightcurve dataset. 
We have developed our method by training on the rich- 
est possible dataset including all known types of sources 
and testing it also on the whole dataset. The training 
set includes a variety of variable objects such as QSOs, 
RR Lyraes, Cepheids, eclipsing binaries, long period vari- 
ables, Be stars, microlensing events and also non- variable 
stars. Only one other selection method, |Kozlowski et al.| 
( 2010 1 , has considered Be stars, which are one of the most 
significant contaminants during QSO selections in LMC 
(G03). Our goal is to select hig h confidence QSO candi- 
dates in the MACHO database ( [Alcock et al.|1996[ ) while 
minimizing the number of false positives. Our approach 
employs multiple time series features rather than using 
only the lightcurve structure function. These features 
can characterize various kinds of variability characteris- 
tics. Therefore our algorithm is practical not only for 
identifying QSOs but also for excluding other types of 
variable stars and non- variable stars. To fully utilize the 
features and identify QSOs, we employed a supervised 
machine learnin g classification method, Support Vector 
Machine (SVM, |iBoser et al.||1992| |Cristianini fc Shawe- 
Taylor 2000 Panik||2dU^ ). In the true spirit of machine 



learning, our method uses a classification model trained 
with the training set and thus eliminates the need for 
hard linear cuts and human input (e.g. manual prese- 
lection of variable sets, manual removal of false positives 
and determination of cuts). 

We briefly introduce the MACHO database and known 
MACHO QSOs in Section [2] Section |8] describes the 
multiple time series features that we used to quantify 
the variability characteristics of each lightcurve. Section 
H] introduces SVM, the method used to train the classi- 
fication model. We present the MACHO QSO classifi- 
cation model constru cted using the time series features 
and SVM in Section [O We then show the MACHO 



QSO candidates selected using the model in Section 5.2 



Crossmatched results with astronomical catalogs are pre- 
sented in Sectio n [6} Ongoing and future work is summa- 
rized in Section [7| 

2. MACHO DATABASE AND MACHO QSO 

2.1. MACHO database 

The MACHO survey monitored a wide area of the sky 
to detect microlensing events caused by Milky Way halo 
objects and test the hypothesis that a significant portion 
of dark matter in the Milky Way halo consists of com- 
pact objects such as brown dwarfs or planets (Alcock 



et al. 1996). Because microlensing events are extremely 
rare, MACHO monitored several tens of millions of stars 
in the Large Magellanic Cloud (LMC), Small Magellanic 
Cloud (SMC) and Galactic bulge for 7.4 years. Obser- 
vations started in July 1992 and were completed at the 
end of December 1999. More than 5 Tbytes of image 
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TABLE 1 

11 TIME SERIES FEATURES 



Four new features Brief description; for details, see Appendix. 



and A^i,, 



Stetson Ka_c 



Rc 



Nabove'- the number of points above the upper boundary line of the autocorrelation plot. 

Nbeiow - th e number of points below the lower boundary line of the autocorrelation plot. 

Figure [llj shows the constructed boundary lines based on the autocorrelation functions (see Figure [lo| 
of the traming set lightcurves. 

Stetson K (Eq. [n} variability index derived based on the autocorrelation function 
of each lightcurve. 



The range of a cumulative sum ( |Ellaway|l978| 



Seven other features Brief description; for details, see Appendix. 



a/m 



The ratio of the standard deviation, <t, to the mean magnitude, m. 



Period and Period S/N Period and period signal-to- noise ratio of each lightcurve. 

Derived using Lomb-Scargle algorithm and Lomb periodogram I Lomb|l976 Scargle|l982 i 



Stetson L 
V 

B - R 
Con 



The variability index ( Stetson|l996 1 describes the synchronous variability of different bands. 



The ratio of the mean of the square of successive differences to the variance of data points 
in each lightcurve. 

The average color for each lightcurve. 

The number of consecutive data points that are brighter or fainter than 2a of each lightcurve. 



data and 70,000 exposu res were collected during the pe- 
riod ( |Alcock et al |2000[ ). In addition, MACHO used two 
bands (MACHU a and R) for the observations. 

2.2. MACHO QSOs 

There are in total 59 known QSOs in the MACHO 
database (50 in the LMC fields and 9 in the SMC fields; 
hereinafter MACHO QSOs). Forty-seven were detected 
by G03 and the remaining t welve were QSOs previously 



known from other stu dies ([Blanco fc Heathcote 1986 
Schmidtke et al.||1999[ |Do brzycki ef al. 2002):G(«3e- 
tected 38 of them using variability characteristics of MA- 
CHO lightcurves and nine of them by crossmatching with 
X-ray and radio catalogs. To select QSO candidates, GOB 
applied simple cuts such as color, magnitude and ampli- 
tude on 140,000 pres elected MACHO s ources that show 
strong flux variation ( Alcock et al. 2000 ) . The lightcurves 
of 12 previously known QSOs were used as references 
for the variability cuts. After selecting 2,500 QSO can- 
didates from the 140,000 sources, G03 removed known 
MACHO variable stars from the candidate list and then 
manually examined the remaining candidates to elim- 
inate false positives. They eventually removed about 
2,140 candidates and confirmed that the majority of 
the removed candidates were objects with quasi-periodic 
variability such as blue variable stars. Blue variable 
stars typically show strong Balmer emissi on lines and 
are th ought to be associated with Be stars (Keller et al. 
2002 1. It is also know n that Be stars show variability 
similar to QSO s (Eyer 20021 iGe ha et al.||2003[ [Mennick- 
ent et al. 2002j Keller et al. 200^^ Using spectroscopic 
instruments, G03 observed 259 candidates selected from 
the remaining 360 candidates and also the candidates se- 
lected using the catalog crossmatchings. G03 confirmed 
47 new QSOs with magnitudes 16.63 < my < 20.10 and 



redshifts between 0.28 and 2.77. 

G03 analyzed only 30 of the 82 MACHO LMC fields, 
and thus the remaining 52 MACHO LMC fields have not 
been searched for QSOs. Moreover, they selected QSO 
candidates from the preselected 140,000 variable sources 
and did not analyze the remaining several tens of mil- 
lion lightcurves. Thus it is very likely that there are a 
lot more QSOs that have not been detected yet. In the 
following sections, we introduce a new QSO selection al- 
gorithm to detect these non-identified QSOs in the MA- 
CHO LMC database. 

3. TIME SERIES FEATURES 

In order to separate QSOs from non- variable stars and 
variable stars, we quantify the variability characteristics 
of lightcurves using 11 time series features. These 11 
features were independently proposed to quantify cer- 
tain types of variability features including amplitudes, 
periods, colors and distribution of data points. They can 
complement each other because they pick out different 
variability features. Thus, by using these multiple fea- 
tures, we can identify various types of variability char- 
acteristics (e.g. non-varying sources, periodic variables 
and non-periodic variables). Note that we selected these 
time series features not only for characterizing QSO time 
series but also for characterizing other types of variable 
sources or non- variable sources because we want to iden- 
tify QSOs while excluding the other types of sources at 
the same time. We briefly describe these 11 time series 
features in Table [l] See Appendix for details about the 
features consisting of four new features that we have de- 
veloped for this work and seven previously used features. 

Figure [l] shows scatter plots of all 11 time series fea- 
tures. Different colors and symbols denote different types 
of sources. The red squares are QSOs, the blue crosses 
are Be stars, the magenta crosses are microlensing events. 
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RR Lyraes 


X Eclipsing binaries 


+ Be Stars 


+ Microlensing 


X Cepheids 


None variables 




stetson stetson A"^c 



Fig. 1. — Scatter plots of the 11 time series features. The axis of each panel is different time series feature. As the panels show, each type 
of variables is clustered in certain areas. 1) The top left panel: each type of the periodic variables are clustered at each different area. It 
also shows one day or multiple days period aliases caused by MACHO's observational nightly pattern. 2) The top right panel: rj is relatively 
small for QSOs, Be stars and LPVs which have positive autocorrelation. Color (i.e. difference between average ma gnitude of MACHO 
B a nd R bands) is useful in separating QSOs from some other types of variables as several other studies suggested jGiveon ct al. 1999] 
|Ey cr 2002 Geha et al.|2003l l. 3) The middle left panel: N above vs Ni^^i^^. The panel shows almost none of the non-variables and periodic 
variables except LPVs because they do not have data points above (below) the boundary lines by construction. 4) The middle right panel: 
Res is relatively larger for QSOs. cr/m also separates some variable types. For instance. Be stars have relatively smaller values of a/fh 
than QSOs. 5) The bottom left panel: Stetson L is effective in separating any type of variables from non-varialsles except microlensing 
events while Stetson Kj\c is practical to separate QSOs, Be stars and LPVs from others. 6) The bottom right panel: Con can be used to 
separate non-variables from others because non-variables have relatively smaller Con than the others. For details about each feature, see 
the text and Appendix. 
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the cyan crosses are LPVs, the green x's are Cepheids, 
the yellow x's are RR Lyraes, the black x's are eclips- 
ing binaries and the gray dots are non-variables. As 
each panel shows, not only QSOs but also other types 
of variables are clustered in certain areas, which means 
each time series feature is good at separating some of 
the variable types. Thus we did not implement a feature 
selection algorithm that removes uninformative features. 
See Section^ for a brief explanation about a general fea- 
ture selection concept. We selected a subset of MACHO 
lightcurves of each variable type to derive the time series 
features shown in the figure. We also used the same sub- 
set to train the classification model for selecting MACHO 
QSO c andid ates. For details about the training set, see 
Section 15.11 

A simple and conventional method for selecting QSOs 
using these features is to define cuts in the 2D-space 
shown in Figure [T] motivated by empirical observations 
of known classes. However, each panel exhibits a unique 
and complex structure of the features, which suggests 
that defining simple cuts is difficult. Moreover note that 
each panel in the figure is a 2D projection of the original 
IID time series feature space. This implies that even if 
there exist proper cuts in the hyperspace that can sepa- 
rate the classes, these cuts could be obscured or invisible 
in any of the projections. Therefore, using simple cuts 
empirically derived from the projection could be inap- 
propriate for the classification. In order to alleviate the 
problem of introducing empirical cuts and thus to fully 
utilize the derived 11 time series features, a classification 
algorithm should be capable of defining boundaries (e.g. 
cuts) in the hyperspace. For this purpose, we employed 
SVM which produces hyperplanes between classes in any 
multi-dimension space. SVM also can define non-linear 
boundaries using kernel functions while cuts are gener- 
ally linear. In the following section, we briefiy explain 
SVM. 

4. SUPPORT VECTOR MACHINES 



SVM (Boser et al. |1992 ) is a family of supervised ma- 
chine learning algorithms that can train a 2-class classi- 
fication model using samples of two known classes (i.e. 
training data). A SVM classifier can be seen as a single 
node neural network with an implicitly defined high di- 
mensional feature space. It is currently one of the best 
classification methods in machine learning. Compared 
to neural networks SVM provide a fiexible classification 
model, avoid the problems of local minima, and reduce 
the need for parameter tuning. Several efficient optimiza- 
tion methods have been developed for SVM training in 
recent years. For an overview, discussion and practi- 
cal details the reader is referred to [Cristianini fc Shawc- 



Taylorl ( [2000| ); [Bennett fc Campbell] (|2000|; jHsu et al 



( |2003] 

SVM have been applied extensively in many appli- 
cation areas, and in particular to various astronomical 
applications such as th e classification of variable stars 
(Wozniak et al. 2004b), the sel ection of Active Galac - 
tic JNuclei (AGIN) candidates ( [Zhang fc Zh ao| |2004|^ 



the d etermination of photometric redshitt |Wadadckar 
2005 1 , th e classification of galaxies using synthetic galaxy 
spectra (Tsalmantza et al. 20071 and the morphologi- 
cal classificatio n ot ga laxies using image data ( jHuertas-| 
Company et~al][2008[ ) . 



The classifier of a SVM defines a linear hyperplane that 
separates two classes in a training set. To select a unique 
hyperplane among the set of possible hyperplanes that 
separate the data, SVM chooses the hyperplane which 
maximizes the margin between the two classes, and is 
therefore often called the maximum margin separator. 
However, in many cases, it is not possible to find any 
hyperplane that can perfectly separate two classes. In 
other words, a training set of two classes cannot be sep- 
arated without errors. In order to solve this problem, 
soft margin SVM which allows errors i n a training set 
(i.e. misl abeled samples) was proposed (Cortes & Vap- 
nik 19951. The soft margin SVM uses a modified opti- 
mization criterion where a constant, C > 0, controls a 
tradeoff between maximizing the margin and minimizing 
the errors of a classification model. The parameter C 
needs to be selected appropriately in every application 
to balance the margin with the errors. Small C allows a 
large margin between two classes and thus tends to ignore 
mislabeled samples. On the other hand, large C allows a 
small margin and tries to separate even mislabeled sam- 
ples. Another approach to address non-separability is 
to map the examples into a (typically high dimensional) 
feature space where the data might be better separated. 
Such mappings are captured implicitly by SVM as well 
as several other learning methods. To achieve this, SVM 
employ non-linear kernel functions that capture inner 
products in the implicit feature space. Intuitively the 
kernel can also be seen to be a similarity function acting 
in the expanded space. When this is done the hypothesis 
of SVM has the form: 



Class{z) = sign(y^ aiy^K{z, Xj)) 



(1) 



where z is the example we are predicting the label for, Xi 
are the training data (i.e. the vectors of time series fea- 
tures), Ui are the labels for the ith training data, and i are 
indices for training examples. The ai are the parameters 
learned by the training procedure. The construction of 
SVM shows that this form captures a linear separator in 
the feature space for which K(z,Xi) is an inner product, 
and the training procedure chooses the ai that maximize 
the criterion of soft margin. Despite the mapping to a 
potentially high dimensional space, the maximum mar- 
gin criterion leads to automatic capacity control and thus 
avoids overfitting. 

Many forms of kernels exist in the literature, and the 
the most commonly used are the polynomial and the 
RBF (radial basis function) kernels. In this work, we 
followed standard practice (Hsu et al. 2003) and used 
the RBF defined as: 



K{xi,Xj) = exp(-7||xi 



7>0, 



(2) 



where Xi, Xj are two examples, and the kernel parame- 
ter 7 determines the width of the kernel function. The 
implicit feature space in this case is known to be of infi- 
nite dimension. As in the case of the parameter C, the 
value of 7 needs to be selected appropriately for the ap- 
plication. One can readily observe that this kernel mea- 
sures similarity between examples and that 7 controls 
how fast the similarity decays with respect to the dis- 
tance between the examples. Seen in this light, the clas- 



6 



sifier (EquationfTl) can also be seen to be a weighted form 
of nearest neighbor classification where the at weight the 
importance of training examples. 

It is well known that the choice of 7 and C can affect 
the results dramatically. In order to determine the best 
values for our application we used gri d search with th e 
10- fold cross-validation and technique ( Hsu et aL||2003 1 . 



• Cross-validation 

We divide each class into 10 subsets (i.e. 10-fold 
cross-validation) and select nine subsets to train 
a classification model. We then apply the trained 
model to the remaining subset and count the num- 
ber of true positives (i.e. number of QSOs that 
the model identifies as QSOs), the number of false 
positives (i.e. number of non-QSOs that the model 
identifies as QSOs) and the number of false nega- 
tives (i.e. number of QSOs that the model iden- 
tifies as non-QSOs). We repeat this process ten 
times with all different combinations. Finally we 
sum the true positives, false positives and false neg- 
atives from each iteration, and calculate the recall 
and precision defined as: 



Ntp . . Ntp , - 
recall = , precision — , 3 

Ntp + Nfn Ntp + Nfp 

where Ntp is the sum of the true positives, Npp is 
the sum of the false positives and Np ^ is the sum 
of the false negative^ 

• Grid search 

To select the best C and 7, we search in a log-scale 
evenly spaced 10x10 grid with values from 10~^ 
to 10^. We then perform a 10-fold cross-validation 
and select C and 7 that gave the best recall and the 
best precision. We then define a finer 10x10 grid 
and repeat the 10-fold cross-validation test with the 
new set of parameters. We repeat this procedure 
until recall and precision are not improving any 
more. 

Standard SVM does not provide probability o utput. 
Thus we employed Piatt's probability estimation (Piatt 
19991 to derive class probabilities. The Piatt posterior 



probability is calculated using a sigmoid function as: 



Pr{y = l\x) = 



1 



1 



(4) 



where / is a decision function such that sgn(/(a::)) de- 
cides the class of sample x. y is the label for sample x 
(i.e. a value for the class) and takes the values of -1-1 
or -1. As Piatt notes, this amounts to assuming that 
/ corresponds to the log-odds of the positive label; this 
assumption is not fully justified but has been shown to 
work well in many applications. The parameters A and B 
are calculated by minimizing the negative log-likelihood 
of a training data: 

^ False positive rate is 1 — precision = Npp /{Nj-p + Npp) 



min{- ^{t, \og{p^) + (1 - ti) log(l - p,))}. 



(5) 



U = 7^ , Pt 



1 



1 + e^f'+i 



where i are indices of training data, I is the total number 
of the training data and is a label for ith example. The 
derived Piatt class probabilities can be used to check the 
confidences of the predicted classes. 

Many authors have studied feature selection methods 
to remove irrelevant features (e.g. see 



Blum fc Langley 



1997| [B radley fc Manga sarian'19981 [Weston et al.||200i[ 
Li et al. 2003; Chen & Lin 2006 and references therein ) . 



Such feature selections could be useful when there are 
too many features (e.g. more than a few hundred) in- 
cluding both releva nt and irrelevant features. However, 
Nilsson et al. ( 2006 1 found that most known feature selec- 
tion methods occasionally discard even relevant features. 
This work also noticed that SVM is robust against unin- 
formative features as long as there are a sufficient number 
of informative features. Another reason for feature selec- 
tion is to reduce CPU time for extracting features and 
for training models when there exist a great number of 
features. Note that we employed only 11 time series fea- 
tures (see the previous section and Appendix) , and all of 
them are informative for separating some of the classes as 
shown in Figure [T] Thus it is not necessary to implement 
feature selection methods in this work. 

5. MACHO QSO CANDIDATE SELECTIONS USING SVM 
CLASSIFICATION MODELS 

5.1. Training Classification Models 

Using the 11 time series features and SVM, we trained 
a classification model for selecting MACHO QSO candi- 
dates. To train the model, we first selected a training 
set which consists of 58 MACHO QSOfQ 1,629 variable 
sources of known types (128 Be stars, 582 microlcns- 
ing events, 193 eclipsing binaries, 288 RR Lyraes, 73 
Cepheids and 365 LPVs) and 4,288 non- variable sources. 
We selected these variables from the list of known MA- 
CHO variable sources. Table [2] shows the number of the 
known MACHO variables we collected from S IMBAD's 
MACHO variable catalo^^ ( Alcock et aL]|200l|) and also 
from several literature sources"? Alcock et al. 1997c|a|b}_ 
Keller et alT] |2002| Thomas et al.||2005| f] 



Wood 2000 



ib select non-variable stars, we randomly chose a subset 
of MACHO lightcurves from a few MACHO LMC fields 
and removed all the known MACHO variables from the 
subset. 

We then derived the 11 time series features for indi- 
vidual MACHO lightcurves in the training set. Before 

^ We removed one MACHO QSO from the dataset because it 
has only 50 data points while the rest of the MACHO QSOs have 
at least several hundred data points. 

http: / / vizier.u-strasbg.fr/viz-bin / VizieR?-source=II/247 

* We added more than several thousands of new variable 
candidates selected in the MACHO LMC database to the ta- 
ble. These were identified by an another group at the Time 
Series Center, Initiative in Innovative Computing at Harvard 
(http://timemachine.iic.harvard.edu). The statistical characteris- 
tics ot the candidates will be separately published soon. For details 
about the selection algorithm, see Wachman et al. (2009^ 1. 
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TABLE 2 

Number of known MACHO variables 



TABLE 3 

Recall and precision during the cross-validation 



Variable types 



# 



References 



Band Recall Precision False Positive; ^ 



RR Lyraes 
Cepheids 
Eclipsing binaries 
LPVs 
Blue variables 
Microlensings 

Be stars 

RR Lyraes 
Cepheids 



9,722 
1,868 
6,835 
3,049 
1,262 
626 

136 

8,292 
1,452 



Alcock et al. ( 2001 1 
Xlcock et al. ( 2(M I 
Slcock et al. ( 200T I 
Wood (2000) 



jKeller et al. ( 2002 ) 
I Alcock et al. J 1997c a b| 
|l |l055| 



Thomas et 



private communication 
with Geha, M. 
from a separate work done 
by our group, 
see the text for details E 



Total 



33,242 



deriving the features, we removed all data points in each 
lightcurve with photometric errors greater than three 
times the average photometric errors]j The photomet- 
ric errors are gi ven by the MACHO photometric pipeline 
( [Alcock et aL||l999 ).' 

We then employed a 2-class classification SVlVjj us- 
ing the RBF. We empirically found that 2-class SVM 
with the RBF achieves better recall and precision than 
2- or multiple-class SVM with other kernels including 
linear kernel. We applied a 10-fold cross-validation and 
grid search to all the combinations of 2- or multiple-class 
SVM and different kernels. We found that 2-class SVM 
with the RBF showed the best recall and precision. To 
use a 2-class SVM, we defined the MACHO QSOs as the 
members of one class and all others as members of the 
other class. In order to derive the best C and 7, we per- 
formed a 10-fold cross-validation and grid search using 
the training set as described in the previous section. We 
performed the test on each MACHO band; one for the 
B band and one for the R band. Table |3] shows the de- 
rived best recall and precision of each band. As can be 
seen from the table, the B (R) model shows 82.8 (72.4)% 
recall and 75% precision, which means the B (R) model 
misses 17.2 (27.6)% of the MACHO QSOs and has 25 
(25)% false positive rate. For the B model, the false pos- 
itives consist of 12 Be stars, three microlensing events 
and one LPV; for the R model, 11 Be stars and three 
microlensing events. Although the majority of the false 
positives were Be stars as expected, the models excluded 
more than 90% of the 128 Be stars in the training set. It 
is worth mentioning that recall and precision could vary 
depending on which set of variables and non- variables we 
choose to use as a training set. For instance, if we exclude 
the 128 Be stars from the training set, we can increase re- 
call to 95% with a 7% false positive rate. We can further 
increase recall and precision if we also remove microlens- 
ing events and LPVs from the training set. However, 
note also that the higher recall and precision does not 
guarantee a better model because the model would not 
be able to distinguish QSOs from the false positives such 
as Be stars, microlensing events and LPVs when applied 
to the whole dataset. 

Finally, we trained two models, one each for the MA- 

^ SVM cannot consider errors of features while training a model. 
We used the LIBSVM package (Chang fc Lin^2001J . 
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Fig. 2.— Piatt probabilities for the known MACHO QSOs. The 
top (bottom) panel is the Piatt probabilities of the B (R) band 
lightcurves. 



CHO B and R bands, using the derived best C, 7 on 
the whole training selQ We used the trained models to 
select QSO candidates from the MACHO database (see 



Section 5.2 1. Although the rate of derived false positives 



mentioned in the previous paragraph is 25%, it should 
not be expected that the selected MACHO QSO candi- 
dates using the models would have 25% false positives. 
This is because the training set is not complete; also, it is 
nearly impossible to take into account every known type 
of variability existing in the MACHO database, which 
includes not only astronomical variables but also non- 
astronomical photometric defects or systematic errors. 
In addition, the fraction of QSO in the whole dataset 
is likely to be different than the training set. Thus the 
true false positive rate for the MACHO QSO candidates 
could be higher than 25%. We will come back to this 
point when we discuss crossmatching the candidate list 
with known catalogs in Section [6j 

In addition. Figure [2] shows the Piatt probabilities of 
the known MACHO QSOs for B (the top panel) and 
R (the bottom panel) band lightcurves. As the figure 
shows, the majority of the QSOs have higher probabil- 
ities than 80%. We used the Piatt probability of each 
MACHO Hghtcurve to select MACHO QSO candidates 
(see Section 5.2). 



5.2. MACHO QSO Candidate Selections 

To select the MACHO QSO candidates, we first de- 
rived the 11 time series features for the whole 40 million 

^ This model is slightly different from the one used for the cross- 
validation because it was trained on the whole training set as op- 
posed to 9/10 of the training set. 
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Fig. 3. — Example lightcurves of the QSO candidates. The x-axis 
is modified JuUan Date (MJD), and the y-axis is V magnitude, my. 
Each Ughtcurve manifests non-periodic and strong flux variation. 

MACHO LMC lightcurvesj^ We removed the data points 
in each hghtcurve which have photometric errors greater 
than three times the average photomet ric e rrors as we 
did during model training (see Section 5.1 1. We then 



apphed the trained models to each lightcurve and de- 
rived the QSO Piatt probability estimation. Finally we 
selected only the lightcurves which had the probability 
product of B and R bands higher than 25% (e.g. 50% 
probabilities in both B and R bands). Using the 25% 
cut, we selected 1,620 QSO candidates from the entire 
MACHO LMC database. We show example lightcurves 
of the QSO candidates in Figure |3] As the figure shows, 
all the lightcurves have strong andnon-periodic flux vari- 
ation, which is the variability characteristic of QSOs. 

Figure |4] shows recall and false positive rates corre- 
sponding to the probability product cuts on the training 
set. Using the 25% cut, we correctly identified 82.8% 
of the known MACHO QSOs (48 out of 58) with a 0% 
false positive rate. Although a probability cut lower than 
25% yields better recall and also a 0% false positive rate, 
we choose the 25% cut because our training set is not 
complete, as mentioned in the previous section. 

6. CROSSMATCHING RESULTS WITH INFRARED AND 
X-RAY CATALOGS 

In order to estimate the true false positive rate without 
spectroscopic confirmation, we crossmatched the candi- 
dates with other astronomical catalogs. In the following 
subsections, we present the crossmatching results and the 
false positive rate estimated on the basis of the cross- 
matched counterparts. 

6.1. Crossmatching with the Spitzer SAGE LMC catalog 



* If an object does not have a lightcurve of any particular band, 
we ignore that object. Nevertheless, almost all of the 20 million 
MACHO objects have both B and R band lightcurves, so the overall 
selection efficiency is not affected. 



It is known that mid-IR color selection is efficient at 
separating AGNs from other galaxies or stars because the 
spectral energy distributions o f these types are substan- 
tially differ ent from each other (iLaurent et al.|20 00 ; Lacy] 
et al.l2004||Trichas et al.|2010|feiitountzou et^al |20ldr 
Based on these characteristics, ILacy et al.l (|2004() a ' 



Based on these c haracteristics. Lacy et aL| ()2004| and 
Stern et al. ( 2005 ) introduced a mid-TR color cut to sep- 
arate AGlNs usmg the the Spitz er SAGE (Surveying the 
Agen ts of a Galaxy's Evolutio n; Me ixner et al.|2006 ) cat- 
alog. i Kozlowski fc Kochanek] ( |2009[ ) (hereinafter KK09) 
employed the mid-IR color cut and selected about 5,000 
AGN candidates from the Spitzer SAGE catalog. KK09 
also confirmed that the mid-IR color cut successfully 
identified most of the known QSOs in the SAGE foot- 
prints. 

To check whether our candidates are inside the mid- 
IR selection cut that KK09 used, we crossmatched them 
with the Spitzer SAGE LMC catalog containing 6 million 
mid-IR objects and found 1,239 counterparts. We first 
searched the nearest SAGE source from each of the can- 
didates within a 1" search radius. In order to minimize 
false crossmatchings, we defined the source as a counter- 
part only if there exist no other Spitzer sources within a 
3" radius from the candidate. 

Of the crossmatched counterparts, about 500 had been 
observed with at least three Spitzer IRAC (InfraRed Ar- 
ray Camera) bands. Note that we need a minimum of 
three Spitzer IRAC magnitudes to apply the mid-IR color 
cut. Figure [5] shows the color-color and color-magnitude 
diagrams 01 ;hese counterparts (529 in the color-color 
diagram and 544 in the color-magnitude diagram). The 
solid line in the figure shows the mid-IR color selection 
cut. KK09 suggested that the sources inside region B 
could either be AGNs or black bodies such as stars, while 
the sources inside region A are likely AGNs (left panel). 
In the color- magnitude diagram (right panel), there are 
two regions as well. The region labeled as YSO is thought 
to be dominated by young stellar objects (YSO) while 
the region labeled QSO is thought to be dominated by 
QSOs. Nevertheless, all the sources inside these four re- 
gions (AGN region) are potential QSOs. According to 
Stern et al. ( [2005| ), the candidates inside the AGN re- 
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gion are most likely broad emission line QSOs (i.e. Type 
1 AGNs) . Among them, the sources inside the QSO and 
A regions are the most promising QSO candidates. As 
the figure clearly shows, most of the crossmatched QSO 
candidates are inside the QSO (88.2%; 480 out of 544) 
and the A regions (76.9%; 407 out of 529), which implies 
that most of the candidates are likely true QSOs. The 
number of QSO candidates that are in both the QSO and 
the A regions are 391 out of 52{f] (73.9%). Under the as- 
sumption that all the 391 candidates are QSOs, the false 
positive rate is 26.1%, which is the upper bound of the 
false positive rate. There are only about 9% of the candi- 
dates outside the AGN region (9.3% outside A and B re- 
gions, 9.0% outside YSO and QSO regions), giving us the 
lower bound of the false positive rate. Nevertheless, we 
confirmed that most of the candidates outside the AGN 
region also show strong variability. We show example 
lightcurves of these candidates in Figure [6j As the figure 
shows, they have strong and non-periodic flux variation. 
Note that our method used variability characteristics of 
lightcurves in order to select QSO candidates which could 
be missed by the mid-IR color selection. Moreover, the 
mid-IR color cut is not very efficient a t selecting narrow 
emission line QSOs (Stern et al. 2005). Therefore some 
of the candidates could be either broad or narrow emis- 
sion line QSOs even though they are not inside the AGN 
region, which would further decrease the lower bound of 
the false positive rate. 

In addition, we also crossmatched the known MACHO 
QSOs and the 33,242 MACHO variables shown in Ta- 
ble [2| with the SAGE catalog to check how many known 
MACHO QSOs and known variables are inside the AGN 
region. Such variables inside the AGN region could be 
contaminants (i.e. false positives) for any mid-IR color 
selection method. We found about 50 counterparts with 
the known MACHO QSOs and about 3,900 counterparts 
with the variables. We also crossmatched about 200,000 
MACHO field sources from one randomly selected MA- 
CHO field with the SAGE catalog and found -10,000 
counterparts. These field source counterparts might con- 
sist of all types of objects including non-variable stars, 
unclassified variable stars and galaxies. Figure [7] shows 
all the crossmatched counterparts. The black squares 
are the MACHO QSO counterparts (48 in the color-color 
diagram and 49 in the color-magnitude diagram). The 
black crosses are the counterparts with the variables in- 
cluding RR Lyraes, Cepheids, eclipsing binaries, LPVs 
and blue variable stars (3,871 in the color-color diagram 
and 3,880 in the color- magnitude diagram). We sepa- 
rately depict eight Be stars as gray diamonds in the fig- 
ure. The gray dots are the MACHO field source coun- 
terparts (10,238 in the color-color diagram and 10,292 
in the color-magnitude diagram). As the figure shows, 
almost all of the MACHO QSOs are inside the AGN re- 
gion as expected. However, a few tens of the variables 
and the MACHO field sources are also inside the AGN 
region. We checked these variables in the AGN region 
and found that they consist of all types of the known 
MACHO variable stars such as RR Lyraes, Cepheids, 
eclipsing binaries, blue variables and LPVs. Moreover 
nearly all Be stars that have Spitzer counterparts are in- 

^ 529 is the total number of the Spitzer counterparts inside both 
color-color diagram and color-magnitude digram. 



side the region as well. It is known that Be stars are 
characterized by their IR emission due to dusty circum- 



stella r environments (Malfait et al. 1998 Leinert et al. 
20041). Also note that we crossmatched only 200,000 MA- 



CHO field sources with the Spitzer catalog. If we scaled 
our selection to the total MACHO LMC database cov- 
ering 20 million stars, more than several thousand field 
sources would be in the AGN region, providing signifi- 
cant contaminantion for QSO selection. According to the 
results, it seems that the mid-IR cut is not efficient for 
separating QSO candidates from various types of stars 
although it is practical for confirming QSO candidates, 
especially when applied to massive databases. In other 
words, the mid-IR selection cut shows relatively low pre- 
cision, although it shows high recall. Thus it is clear that 
algorithms based on the variability of lightcurves, includ- 
ing ours, are important for QSO candidate selections. 

6.2. Crossmatching with X-ray Catalogs 
We crossmatched our Q SO candidates wit h the Chan- 



dra X-ray source catalog ( Evans et al.||2009) and XMM 
Newton 2"^^ Incremental Source catalog (Watson et al. 



2009| . We searched for the nearest Chandra (XMM) 
source within a 5" search radius from the candidate. We 
only selected the source as a counterpart if there existed 
no other Chandra (XMM) sources within the search ra- 
dius. Nevertheless, most of the X-ray counterparts were 
placed within a 3" distance from the candidates. 

As a result, we found 60 X-ray counterparts. It is 
known that QSOs show higher X-ray to optical flux ra- 
tios than typical galaxies or stars, fx /.fr, owing to the 



accretion on the central black holes (Reeves & Turner 



2000 



Hornschemeier et al. 2001( 1 . To calculate fx/ fr, we 



first derived the my and m/j [i.e. standard Johnson's V 
and Kro n— Cousins i?) using the MACHO B and R mag- 
nitudes (lAlcock et al. 11999 IKunder & Chaboyer||2008 ) . 



We then converted the my and mp to SUSS r magm- 
tude using the formula from the SDSS websittp°] ( Lupton] 
et al. 2005 1 . Note that this formula was deriveonot basea 
not on QSOs but on photometric standard stars (jStetson| 
2000| . Thus the converted SDSS r magnitudes of QSOs 
could have larger errors (i.e. standard deviation) than 
the estimated errors for the standard stars, a ~ 0.01. 
Nevertheless , we fin ally used the following equation from 
Green et al.| ([20041) to derive log(/x//r): 



log^ = log fx 

Jr 



0.4r + 5.67, 



(6) 



where fx is the X-ray flux in units of ergs cm~^ s~^ 
in the range of 0.5-2.0 keV, which is extracted from the 
Chandra and XMM catalogj^ fr is the optical flux and 
r is the converted SDSS r magnitude. 

The top panel of Figure [s] shows the fx j fr of 60 coun- 
terparts with the Chandra and XMM catalogs. The x- 
axis is log{fx/fr), and the y-axis is the converted r mag- 
nitude. In the panel, we also show 16 known MACHO 
QSOs that have X-ray counterparts. The black marks 
are the MACHO QSO counterparts, and the gray marks 



http: / /www. sdss.org/dr47algorithms / sdssUBVRITransform.html 
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ergs cm^" s^^, which is the unit for the Chandra sources, 
is identical with fO^'' Watt m~^, which is the unit for the XMM 
sources. 
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Fig. 5. — Mid-IR color-color and color-magnitude diagrams of the Spitzer SAGE counterparts crossmatched with the QSO candidates. 
Each axis of the fig ure is either Spitzer magnitude or color. All sources inside the region A, B, QSO and YSO are potential QSOs | |Kozlowski| 
|fc Kochanek|2009| . The majority of the candidates are inside the region A and QSO, which is the most promising QSO regions. 

ray catalog. We selected the field so that it overlapped 
with the Chandra footprints. In the top panel of Figure 
[8 we show the fx/ fr of the 21 crossmatched MACHO 
objects (black dots). These counterparts could be ei- 
ther stars or AGNs, although they are most likely X-ray 
emitting stars su ch as X-ray binaries, W-UMa binaries 
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Fig. 6. — Examples lightcurves of the QSO candidates outside the 
AGN region. The x-axis is MJD, and the y-axis is V magnitude. 
All of them show strong and non-periodic flux variation. These 
QSO candidates could either broad or narrow emission line QSOs 
although they are outside the AGN region. 



are the QSO candidate counterparts. The squares are 
XMM counterparts, and the triangles are Chandra coun- 
terparts. The dashed line corresponds to fx/ fr = 0.1, 
which is th e criterion separati ng AGNs and typical galax- 
ies or stars ( Green et al.|2004" |. The two dash-dotted lines 
are boundaries of the contusion area shown as the dashed 
area in the bottom panel (see the following paragraph). 
As the figure shows, most of the MACHO QSOs (75.0%; 
12 out of 16) and our QSO candidates (73.3%; 44 out 
of 60) show higher fx/fr than 0.1. If all the candidates 
with higher fx/fr than 0.1 are QSOs, the false positive 
rate is 27.3%. 

In addition, to estimate how a large portion of non- 
AGNs could have fx/fr > 0.1, we crossmatched all the 
objects from one MACHO field with the Chandra X- 



( |Chen et al.|2006[ ), Algol type binaries (Singh ct al. 199^ 
and c ataclysmic variable stars (e.g. see Wonnacott et al] 



1994[ ) since the number density of such stars surpasses 
tHenumber density of AGNs. Of the 21 MACHO ob- 
jects, 16) have fx / fr smaller than 0.1, which implies that 
non-AGN objects generally have smaller fx/fr than 0.1. 
The remaining five objects have fx/fr larger than 0.1 
and could be AGN candidates. We show the lightcurves 
of these five objects in Figure 9j As the figure shows, they 
do not manifest any strong flux variation and thus were 
not selected as QSO candidates by our selection method. 

Based on the crossmatching results mentioned in the 
previous paragraphs, we further improved the region of 
confidence using the histogram of log(/x//r) shown in 
the bottom panel of Figure [sj The x-axis is \og{fx/fr), 
and the y-axis is normalizea count. The solid line with 
light gray is the histogram of the QSO candidate counter- 
parts, the dashed line with medium gray is the histogram 
of the MACHO QSO counterparts and the dotted line 
with dark gray is the histogram of the 21 MACHO object 
counterparts. The dashed area shows the confusion area 
where stars and QSOs could be mixed together. Con- 
sidering all the histograms, we modified the confidence 
regions as: 

• log(/x//r) < ^1-5 : the non-QSO area 

In this region, log(/x//r) is much smaller than the 
AGN criterion of ^og(fx/fr) = —1- Thus the can- 
didates in this region are not likely QSOs. There 
are only 1 out of 16 (6.2%) MACHO QSOs, 4 out 
of 21 (19%) MACHO objects, 7 out of the 60 (11%) 
candidates inside this region. 

• —1.5 < fx/fr < —0.5 : the confusion area that is 
a mixture of stars and QSOs 

Most of the MACHO objects (76.2%; 16 out of 21) 
are in this region. More than a half of the MACHO 
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Fig. 7. — Mid-IR colors of the Spitzcr SAGE counterparts with the known MACHO variable stars and the MACHO field sources. Each 
axis of the figure is cither Spitzcr magnitude or color. The black scjuares are MACHO QSOs, the gray diamonds arc Be stars, the black 
crosses are variable stars including RR Ryraes, Cepheids, eclipsing binaries, LPVs and blue variable stars. The gray dots are MACHO field 
sources. Almost all MACHO QSOs are inside the region A, B, QSO and YSO, which indicates the mid-IR selection criteria is efficient at 
confirming QSOs. However there are a lot of other variable stars including Be stars inside the regions as well. Thus the mid-IR selection 
might not be practical for selecting QSO candidates. 



QSOs (55.3%; 9 out of 16) and 32 out of the 60 QSO 
candidates (53.3%) are also in this region. 

• !xl!v > -0.5 : the QSO area 

Most of the candidates in this region would be 
QSOs because of their high fx/fr- As the his- 
togram shows, only 1 out of 21 (5%) MACHO ob- 
jects is in this region while 6 out of 16 (37.5%) 
MACHO QSOs and 21 out of 60 (35%) candidates 
are inside the region. 

As we mentioned above, 21 out of the 60 candidates 
are inside the QSO area and are likely true QSOs, which 

gives the upper bound of the false positive rate, 65.0% 
(39/60). In addition, some of the 32 candidates inside 
the confusion area could be also QSOs because more than 
half of the known MACHO QSOs are inside the confusion 
area. Thus the lower bound of the false positive rate is 
11.7% (7/60). 

7. ONGOING AND FUTURE WORKS 

We will observe the QSO candidates with spectroscopic 
instruments to check whether they arc QSOs. Based on 
the projection of the models and the crossmatching re- 
sults, we expect at least several hundred candidates to 
turn out to be QSOs. 

Using the confirmed QSOs and the false positives, we 
will improve our model. The current model is con- 
structed based on the relatively small mimber of known 
QSOs (i.e. 58 known MACHO QSOs), which may be too 
small a sample to represent the true variability character- 
istics of all QSOs in the MACHO database. Thus using 
a large number of QSOs (i.e. more than a few hundreds) 
would help improve the models. 

In addition, our model is effective at selecting not only 
QSOs but also other types of variable sources. Prelimi- 
nary tests showed that recall and precision for periodic 
variables such as RR Lyraes, Cepheids and eclipsing bi- 
naries, were almost 100%; for LPVs, microlensing events 
and Be stars, recall and precision were 80%. 



8. SUMMARY 

In this paper, we presented a new QSO selection algo- 
rithm based on 11 time series features and a supervised 
classification. We first introduced 11 time series features 
to quantify variability characteristics of lightcurves. We 
then tised Support Vector Machine (SVM) to train a clas- 
sification model which separates QSOs from other types 
of variable stars and non-variable stars. Using the train- 
ing set of the MACHO variables (128 Be stars, 582 mi- 
crolensing events, 193 eclipsing binaries, 288 RR Lyraes, 
73 Cepheids and 365 LPVs) , 4,288 non- variables and the 
58 known MACHO QSOs, we trained the models for each 
MACHO B and R band. The trained model correctly 
identified about 80% of the MACHO QSOs with 25% 
false positive rates on a cross-validation test. The ma- 
jority of false positives during the training were Be stars 
known to show variability similar to with QSOs. 

We appHed the model to the whole MACHO LMC 
database consisting of 40 million lightcurves (i.e. 20 mil- 
lion from each MACHO band) in order to select QSO 
candidates. As a result, we found 1,620 candidates from 
the MACHO LMC database. During the selection, none 
of the known MACHO variables were mis-selected as 
QSO candidates. To estimate the true false positive rate 
of the QSO candidates, we crossmatched the candidates 
with astronomical catalogs, including the Spitzer SAGE 
LMC catalog and some X-ray catalogs. The crossmatch- 
ing results confirmed that most of our candidates are 
promising QSO candidates. For instance, the majority 
of candidates with Spitzer counterparts are inside the 
AGN region that is defined by a mid-IR color cut and 
is known to be effective in confirming QSO candidates. 
The crossmatching with X-ray catalogs shows that most 
of the X-ray counterparts have fx/fr > 0.1 and therefore 
are likely QSOs. 

In addition, during the crossmatching with the SAGE 
LMC catalog, we found that using only the mid-IR color 
cut is not a very efficient method in selecting QSO can- 
didates, although it is an effective method in confirming 
QSOs. This suggests that selection methods using vari- 



12 



CandidateiXMM 
Candidate:Chandra 
MACHO 0SOe:XMM 
MACHO QSOsiChandra 
MACHO objects 



-1.0 -0.5 0.0 



0.5 



ability characteristics of lightcurves, including ours, are 
important to further remove false positives, both vari- 
ables and non-variables. 
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Fig. 8. — fx/fr of the X-ray counterparts with the MACHO 
QSOs, the QSO candidates and the MACHO field objects. The 
top paneh the x-axis is log{fx/fr), and the y-axis is the converted 
SDSS r magnitude. The squares are the XMM counterparts and 
the triangles are the Chandra counterparts. The black marks are 
the MACHO QSOs and the gray marks are the candidates. The 
gray dots are the MACHO field objects. The dashed line is the 
criterion between AGN and others such as galaxies and stars. Most 
of the MACHO QSOs and the candidates have higher fx /fr than 
the criterion while most of the MACHO field objects have smaller 
fx/fr than the criterion, which implies most of the candidates 
are promising QSO candidates. The bottom panel: the histogram 
of fx/fr- The X-axis is log(/x//r), and the y-axis is normalized 
count. Based on the histogram, we refined the region of confidence. 
See the text for details. 
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APPENDIX 

In this appendix, we introduce the 11 time series features including four new features that we have developed for 
this work and the remaining seven features. 

Four new time series features : 

• Three autocorrelation Indices 
These three indices are based on the autocorrelation function. The autocorrelation function is defined as: 



N- 



ACir) = 



{N - r) ct2 



{mi - m){mi+r - m), 



(7) 



where N is the total number of data points, t — 1,2, . . . , N — 1 is the time lag, a is the standard deviation, m 
is the magnitude, i is the index for each data point and rh is the mean magnitude. Figure 10 shows the AC{t) 
for various types of variables and non- variables extracted from the MACHO database. NoteTnat, in each panel, 
we show the AC{t) of multiple objects of that type to demonstrate the overall AC{t) patterns. We used more 
than 50 objects of non- variables, RR lyraes, Cepheids, eclipsing binaries and microlensing events. The overall 
AC{t) patterns were preserved even if we used more objects (i.e. several hundreds). For long period variables 
(LPVs), Be stars and QSOs, we used about 10 object of each type to show individual AC{t) pattern. The x-axis 
is the time lag, r in days and the y-axis is the autocorrelation value. As the figure shows, non-variables and all 
periodic variables b ut LPVs show different AC{t) patterns from QSOs, Be stars, LPVs and microlensing events. 



Schild et al. (2009) also noticed that the AC{t) could be useful for discovering QSOs. Thus, by quantifying the 



we can separate certain types of variables 
features that we are using to quantify AC{t). 



In the following paragraphs, we introduce three time series 



^above^ ^below 

We constructed empirical boundary lines on the AC vs r diagram to separate non-variables and periodic 
variables from others. To do so, we calculated the average and standard deviation of the autocorrelation 
functions for non- variables and periodic variables (except LPVs), for each time lag r. We then constructed 
upper and lower boundary lines to be ±4(t from the average line. Figure 11 shows the calculated upper and 
lower boundary linef^ To derive Nabove and Nieiow for each lightcurve, we counted the number of points 
above, Nabove, and number of points below, A^be/oMu these lines. 

Stetson K 
Stetson K (Eq. 



11) was defined to observe the distribution of measurements between the maximum and 



minimum values oi the measurements (Stetson 1996). For details including the definition of Stetson K, see 
the Appendix. We used Stetson K to characterize the different AC{t) patterns, Stetson Kac- 



• Rn 



Res is the range of a cumulative sum ( Ellaway]|1978 ) of each lightcurve and is defined as: 

We removed fluctuated data points using moving average. 
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Fig. 10. — Set of autocorrelation functions of variable and non-variable stars. The x-axis is the time lag, r, in days, and the y-axis is 
the autocorrelation function value. Non-variable stars, Cepheids, eclipsing binaries and RR Lyraes show different patterns from QSOs, Be 
stars, LPVs and microlensing events. 



Res = max(S') — inin(S'), 

1 I (8) 



N a . ^ 



where max (min) is the maximum (minimum) value of S and I = 1,2, ... ,N. Res is typically large for LPVs, 
microlensing events, Be stars and QSOs while it is relatively small for non- variables and other periodic variables 
such as RR Lyraes, Cepheids and eclipsing binaries. 
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Fig. 11. — Two boundary lines constructed using autocorrelation functions of non- variable stars, eclipsing binaries, RR Lyraes and 
Cepheids. The x-axis is the time lag, r, in days, and the y-axis is the autocorrelation value. Based on the lines, we derived N^bove a^nd 
^below See the text for details. 

Other seven time series features : 

(T 

• — 

TO 

This is a simple variabihty index and is defined as the ratio of the standard deviation, a, to the mean magnitude, 
TO. If a hghtcurve has strong variabihty, a/rfi of the hghtcurve is generally large. 

• Period and Period S/N 



Scargle|l982 


Press & Rybicki|l989 


Press et al.ll992j) 


covers not only short period variab 


le stars such as t 



E 1992|) . We search for periods between 0.1 and 1000 day^^*^} which 
I as KR Lyraes, Cepheids and eclipsing binaries but also LPVs. 



Among the detected periods, we s elected the period with the high est S/N. The S/N of each period is calculated 
based on the Lomb periodogram ( Scargle||l982 Press et al.||1992 ). 



• Stetson L 

Stetson L variability index ( Stetson|1996 1 describes the synchronous variability of different bands and is defined 



as: 



L ^ 



JK 
0.798 



(9) 



where J and K are different Stetson indices. Stetson J is calculated based on two simultaneous lightcurves of a 
same star (e.g. MACHO B and R bands) and is defined as: 



1 ^ 



P% = Sp{i) Sq{i), 



(10) 



Spit) 



N 



N -1 



where i is the index for each data point, N is the total number of data points, sgn(Pi) is the sign of Pi and to is 
the magnitude, p and q indicate two different bands. Up^i is the standard error of ith magnitude of band p. In 
case of the MACHO database, p and q indicate the MACHO B and R bands. To derive J from each MACHO 
time series, we used only the data points which have observations from both MACHO B and R bands at the 
same epoch. 



' We used |vARTOOLs| ( |Hartman et al.|2008| for deriving periods 



and period S/Ns. 
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Steston K is calculated using a single band lightcurve and is defined as: 



K 



1 ELI^( 



(11) 



It is k nown that K = 0.900 for a pure sinusoid and 0.798 for a Gaussian distribution. For details, see |Stetson| 
(p96|. 



In brief, Stetson L is generally large for achromatic variable sources and small for non-variables or chromatic 
variables. 



V 

Variability index 77 is the ratio of the mean of the square of successive differences to the variance of data points. 
The index was originally proposed to check whether the successive data points are independe nt or not. In other 
words, the index was developed to check if any trends exist in the data (von Neumann|[l941 ). It is defined as: 



1 



1=1 



(12) 



The index has been substantially investigated by several authors (see von Neumann 1941 Press 1969 and ref- 
erences therein). In brief, if there exists positive serial correlation, 77 is relatively small. On the other hand, if 
there exists negative serial correlation, 77 is large. |Shin et a l. (2009^) used ry to select variable candidates from the 
Northern Sky Variability Survey database ( Wozniak et ai.||20U4a[ ). 

As the top right panel of Figure [T] shows, 77 is relatively small for the variables which have positive autocorrelation 
such as QSOs, Be stars and LPVs. Non-variables or microlensing events show large ry since they do not have 
strong positive correlation. In the cases of other periodic variables such as RR Lyraes, Cepheids and eclipsing 
binaries, rj is also relatively large even though they are periodic variables and therefore have positive correlation. 
This is because 1) we derive 77 not from the folded MACHO lightcurves but from the original lightcurves and 
2) MACHO observed a field a few times per week, which is not enough to reveal positive correlation for small 
time scales. In other words, most raw MACHO lightcurves of the periodic variables do not have strong positive 
correlation and thus have large rj. 

• B- R 

We used an average color for each MACHO lightcurve as: 



B - R = niB^, - ruR^j , (13) 
where ffiB^,, mean magnitudes of MACHO B, R bands. 

Color infor mation. B — R, is useful iri separating OSOs from s ome other types of variables as several other studies 



suggested (Giveon et al. 



1999 



Eyer 



2002 



Geha et al. 



2003). Nevertheless it is known that coloip^ is not very 



efficient discriminator tor seiectmg i ntermediate redshitt (JSOs (i.e. 2.5 < z < 3.0) although it is'efficient for 
selecting high and low redshift QSOs (Richards et al. 2006 Schmidt et al.|2010 ). Note that we used not only color 
information but also other multiple time series features derived solely based on the variability characteristics of 
lightcurves, which helps to identify intermediate redshift QSOs as well as high and low redshift QSOs. 

Con 



The index was introduced for the selection of variable stars from the OGLE database (Wozniak 2000). To 
calculate Con, we counted the number of three consecutive data points that are brighter or fainter than 2a 
and normalized the number by N — 2. Con is close to zero for non- variable stars while it is relatively large for 
variables. In addition. Con is relatively large for the long-time scale varying sources such as LPVs because such 
variables tend to have plenty of consecutive data points bigger than 2a. 



SDSS u - g color. See |Schmidt et al.|20l"ol for details 



